This started out as an analysis script, but there was still some ambiguity left in how the population descriptors were coded, so we’re going to revisit it! Yay :(

knitr::opts_knit$set(root.dir="~/OneDrive - St Vincent's Institute/Documents/RNA\ Diversity/", echo=T)
# But we also have to do this manually because R studio is stupid: 
setwd(knitr::opts_knit$get("root.dir"))
library(ggplot2)
library(data.table)
library(plyr)
library(reshape)
## 
## Attaching package: 'reshape'
## The following objects are masked from 'package:plyr':
## 
##     rename, round_any
## The following object is masked from 'package:data.table':
## 
##     melt
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::arrange()     masks plyr::arrange()
## ✖ dplyr::between()     masks data.table::between()
## ✖ purrr::compact()     masks plyr::compact()
## ✖ dplyr::count()       masks plyr::count()
## ✖ dplyr::desc()        masks plyr::desc()
## ✖ tidyr::expand()      masks reshape::expand()
## ✖ dplyr::failwith()    masks plyr::failwith()
## ✖ dplyr::filter()      masks stats::filter()
## ✖ dplyr::first()       masks data.table::first()
## ✖ lubridate::hour()    masks data.table::hour()
## ✖ dplyr::id()          masks plyr::id()
## ✖ lubridate::isoweek() masks data.table::isoweek()
## ✖ dplyr::lag()         masks stats::lag()
## ✖ dplyr::last()        masks data.table::last()
## ✖ lubridate::mday()    masks data.table::mday()
## ✖ lubridate::minute()  masks data.table::minute()
## ✖ lubridate::month()   masks data.table::month()
## ✖ dplyr::mutate()      masks plyr::mutate()
## ✖ lubridate::quarter() masks data.table::quarter()
## ✖ dplyr::rename()      masks reshape::rename(), plyr::rename()
## ✖ lubridate::second()  masks data.table::second()
## ✖ lubridate::stamp()   masks reshape::stamp()
## ✖ dplyr::summarise()   masks plyr::summarise()
## ✖ dplyr::summarize()   masks plyr::summarize()
## ✖ purrr::transpose()   masks data.table::transpose()
## ✖ lubridate::wday()    masks data.table::wday()
## ✖ lubridate::week()    masks data.table::week()
## ✖ lubridate::yday()    masks data.table::yday()
## ✖ lubridate::year()    masks data.table::year()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(viridis)
## Loading required package: viridisLite
options(device = "quartz")

allSRAFinal <- readRDS("allSRAFinalTissuesCentersDisease.rds")

Let’s look at the data using the previous population descriptors and country terms.

  1. Who is sequencing who??

Where is the sequencing happening?

allSRAFinal <- allSRAFinal %>% 
    mutate(finalRace = gsub("or ", "or\n", finalRace)) %>%
    mutate(finalGeography = gsub("and ", "and\n", finalGeography)) %>%
    as.data.frame()

geographyClean <- allSRAFinal %>% drop_na(finalGeography)
raceClean <- allSRAFinal %>% drop_na(finalRace)

ggplot(geographyClean, aes(x = finalCountry, fill = finalGeography)) +
  geom_bar(position = position_dodge2(width = 0.9, preserve = "single"), alpha=0.8) +
  theme_bw() +
  ggtitle("Geography across all studies") +
  xlab("SRA Depositor location") +
  ylab("Samples (log10)") +
  scale_y_continuous(trans='log10') +
  scale_fill_viridis(discrete=T, na.value="grey50", option="plasma") +
  theme(axis.text.x = element_text(angle = 45, hjust=1)) +
  theme(legend.position="bottom")

ggplot(raceClean, aes(x = finalCountry, fill=finalRace)) +
  geom_bar(position = position_dodge2(width = 0.9, preserve = "single"), alpha=0.8) +
  theme_bw() +
  ggtitle("Race across all studies") +
  xlab("SRA Depositor location") +
  ylab("Samples (log10)") +
  scale_y_continuous(trans='log10') +
  scale_fill_viridis(discrete=T, na.value="grey50") +
  theme(axis.text.x = element_text(angle = 45, hjust=1)) +
  theme(legend.position="bottom")

More clearly, who is sequencing each pop descriptor?

ggplot(geographyClean, aes(fill = finalCountry, x = finalGeography)) +
  geom_bar(position = position_dodge2(width = 0.9, preserve = "single"), alpha=0.8) +
  theme_bw() +
  ggtitle("Country of sequencing across descriptors") +
  xlab("Geographic descriptor") +
  ylab("Samples (log10)") +
  scale_y_continuous(trans='log10') +
  # scale_fill_viridis(discrete=T, na.value="grey50", option="plasma") +
  theme(axis.text.x = element_text(angle = 45, hjust=1)) +
  theme(legend.position="bottom")

ggplot(raceClean, aes(fill = finalCountry, x = finalRace)) +
  geom_bar(position = position_dodge2(width = 0.9, preserve = "single"), alpha=0.8) +
  theme_bw() +
  ggtitle("Country of sequencing across descriptors") +
  xlab("SRA Depositor location") +
  ylab("Samples (log10)") +
  scale_y_continuous(trans='log10') +
  # scale_fill_viridis(discrete=T, na.value="grey50") +
  theme(axis.text.x = element_text(angle = 45, hjust=1)) +
  theme(legend.position="bottom")

2.5: Resolving ambiguity in the population descriptors:

So… after looking at this, it’s clear there’s still some noise in the descriptor classifications, primarily with who we are labeling as ‘white’ vs European… It’s worth looking at some of these studies more closely. For instance, we know Singapore and Brazil both use the term ‘white’ but it’s very unlikely to be referring to US census categories, and instead reflects their own legal recognition.

First, it’s worth remembering how the labels are made: finalGeography is made by looking at data in columns ETHNICITY, DONOR_ETHNICITY, ancestry and Population, whereas finalRace is made from RACE, primary_race, reported_race, race.ethnicity and donor_race. There’s a lot of downstream clean-up involved, which is perhaps a problem.

There are a couple of options here:

  1. Be very strict: if the column says race, then it’s a race, if the column says ethnicity, then I don’t care what your term is, it’s a geographic origin. This will overcorrect, but should perhaps have been our starting position, to see how things improve with other choices. It should also resolve the problem of studies being split across descriptor categories.
  2. Be somewhat strict: We had previously decided that certain terms (Caucasian, white, African, black) were racial no matter the column label, which is perhaps a problem, so here I have left some terms that are not ambiguous but are clearly in the wrong column unchanged (eg African American or Native Hawaiian is racial, but black is not because it is used outside the USA without reference to the USA context). This will bungle some stuff and I suspect the moving of caucasian and white and black terms will make things way worse when we consider descriptor use within a study.
  3. Manual review Leave things as they were with the original reclassification that we used, which does make some assumptions - caucasian is always racial, and white gets moved around a bit.

Let’s go in order:

Option 1: The original category label trumps all:

This method is worth exploring, to see how things look if we are strict (worth reporting in a supplement). All of the code used here and below is borrowed from 02_clean_ancestry_terms.Rmd but without most of the intermediate qc because this is meant to be brutal and automatic.

This is coded as IGR.term.strictest

popDescriptors <- read.csv("20240702_descriptor_terms_new_schema.csv")

allSRAFinal$mergedGeography <- coalesce(allSRAFinal$ancestry, allSRAFinal$Population) %>%
  coalesce(., allSRAFinal$ETHNICITY) 

allSRAFinal$mergedRace <- coalesce(allSRAFinal$RACE, allSRAFinal$race.ethnicity) %>%
  coalesce(., allSRAFinal$DONOR_ETHNICITY) %>%
  coalesce(., allSRAFinal$reported_race) %>%
  coalesce(., allSRAFinal$primary_race) 

popGeoNew <- popDescriptors[popDescriptors$IGR.coding.strictest %in% "geography",]
popRaceNew <- popDescriptors[popDescriptors$IGR.coding.strictest %in% "race",]

# And now we replace those labels with the ones in the schema:
allSRAFinal$strictestGeography <- popGeoNew[match(allSRAFinal$mergedGeography, popGeoNew$Term),]$IGR.term.strictest
allSRAFinal$strictestRace <- popRaceNew[match(allSRAFinal$mergedRace, popRaceNew$Term),]$IGR.term.strictest

# At this point I originally did a lot of swapping of terms back and forth, but I believe in the strictest case we shouldn't do too much of that, just make sure there's nothing with a double assignment:
# table(allSRAFinal$strictestGeography)
# table(allSRAFinal$strictestRace)
# 
# table(!is.na(allSRAFinal$strictestGeography), !is.na(allSRAFinal$mergedGeography))
# table(!is.na(allSRAFinal$strictestRace), !is.na(allSRAFinal$mergedRace))

# We can quickly check what becomes what, with v2 being the original and v1 being the new one:
geographyTerms <- as.data.frame(table(allSRAFinal$strictestGeography, allSRAFinal$mergedGeography)) %>% .[.$Freq > 0,]
raceTerms <- as.data.frame(table(allSRAFinal$strictestRace, allSRAFinal$mergedRace)) %>% .[.$Freq > 0,]
# geographyTerms
# raceTerms

# There's a few studies with joint assignments...
table(!is.na(allSRAFinal$strictestGeography), !is.na(allSRAFinal$strictestRace)) 
##        
##         FALSE  TRUE
##   FALSE   535 13076
##   TRUE   9397  1020
jointTerms <- as.data.frame(table(allSRAFinal$strictestRace, allSRAFinal$strictestGeography)) %>% .[.$Freq > 0,]
jointTerms
##                         Var1              Var2 Freq
## 3  Black or African American          Americas    3
## 6                      Other          Americas    2
## 7                      White          Americas   10
## 9                      Asian              Asia   39
## 16                     Asian         East Asia  246
## 24 Black or African American            Europe   10
## 28                     White            Europe  607
## 44                     Asian           Oceania    6
## 73 Black or African American Subsaharan Africa   97
mismatches <- allSRAFinal[!is.na(allSRAFinal$strictestGeography) & !is.na(allSRAFinal$strictestRace),] 
table(mismatches$SRA.Study) # Quite a lot of studies... where are they from?
## 
## SRP068551 SRP103099 SRP107326 SRP122876 SRP150833 SRP159625 SRP179998 SRP241159 SRP257773 SRP267193 SRP291048 SRP292867 SRP301528 SRP342015 SRP347253 SRP362734 
##        18        12       208        16        24        10        15       251         6         5         4        24        60        24       304        39
mismatches %>% count(finalCountry, SRA.Study, strictestGeography, strictestRace)
##    finalCountry SRA.Study strictestGeography             strictestRace   n
## 1        Brazil SRP179998           Americas Black or African American   3
## 2        Brazil SRP179998           Americas                     Other   2
## 3        Brazil SRP179998           Americas                     White  10
## 4         China SRP068551          East Asia                     Asian  18
## 5         China SRP362734               Asia                     Asian  39
## 6       Denmark SRP122876             Europe                     White  16
## 7       Germany SRP342015             Europe                     White  24
## 8         Korea SRP292867          East Asia                     Asian  20
## 9         Korea SRP292867             Europe                     White   4
## 10       Russia SRP150833             Europe                     White  24
## 11        Spain SRP301528             Europe                     White  60
## 12       Taiwan SRP107326          East Asia                     Asian 208
## 13          USA SRP103099             Europe                     White  11
## 14          USA SRP103099  Subsaharan Africa Black or African American   1
## 15          USA SRP159625             Europe Black or African American  10
## 16          USA SRP241159             Europe                     White 164
## 17          USA SRP241159  Subsaharan Africa Black or African American  87
## 18          USA SRP257773            Oceania                     Asian   6
## 19          USA SRP267193  Subsaharan Africa Black or African American   5
## 20          USA SRP291048  Subsaharan Africa Black or African American   4
## 21          USA SRP347253             Europe                     White 304
# So in the strictest case, I am making the call that if the study contained both valid labels and is outside the USA, it goes to geography, if it's inside the USA it goes to race. 

geoSRA <- unique(mismatches[!mismatches$finalCountry %in% "USA",]$SRA.Study)
raceSRA <- unique(mismatches[mismatches$finalCountry %in% "USA",]$SRA.Study)

allSRAFinal[allSRAFinal$SRA.Study %in% geoSRA,]$strictestRace <- NA
allSRAFinal[allSRAFinal$SRA.Study %in% raceSRA,]$strictestGeography <- NA
table(!is.na(allSRAFinal$strictestGeography), !is.na(allSRAFinal$strictestRace)) # Done... ish
##        
##         FALSE  TRUE
##   FALSE   535 13668
##   TRUE   9825     0
# Once we decide what to do with the hispanics, it should resolve this problem:
allSRAFinal[is.na(allSRAFinal$strictestGeography) & is.na(allSRAFinal$strictestRace),] %>% count(finalRace)
##                            finalRace   n
## 1 American Indian or\nAlaskan Native  16
## 2                               <NA> 519
allSRAFinal[is.na(allSRAFinal$strictestGeography) & is.na(allSRAFinal$strictestRace),] %>% count(finalGeography)
##   finalGeography   n
## 1           <NA> 535
allSRAFinal[is.na(allSRAFinal$strictestGeography) & is.na(allSRAFinal$strictestRace),] %>% count(hispanic)
##       hispanic   n
## 1     hispanic 387
## 2 non.hispanic 132
## 3         <NA>  16

Option 2: THe original category label trumps most things

Only if it really matches the US Census term in the geography columns, or if it really doesn’t match the census terms in the race columns, are things switched over.

This is coded as IGR.term.strict

popGeoNew <- popDescriptors[popDescriptors$IGR.coding.strict %in% "geography",]
popRaceNew <- popDescriptors[popDescriptors$IGR.coding.strict %in% "race",]

allSRAFinal$strictGeography <- popGeoNew[match(allSRAFinal$mergedGeography, popGeoNew$Term),]$IGR.term.strict
allSRAFinal$strictRace <- popRaceNew[match(allSRAFinal$mergedRace, popRaceNew$Term),]$IGR.term.strict

table(allSRAFinal$strictGeography)
## 
##                      Americas                          Asia                     East Asia                        Europe                      Multiple North Africa and Western Asia 
##                           164                           304                          1121                          5936                           206                            18 
##                         Other                    South Asia                Southeast Asia             Subsaharan Africa 
##                            62                           718                            49                           600
table(allSRAFinal$strictRace)
## 
##          American Indian and Alaska Native                                      Asian                  Black or African American                                   Multiple 
##                                         65                                        943                                       1816                                        237 
## Native Hawaiian and other Pacific Islander                                      Other                                      White 
##                                          8                                        217                                      10135
# table(!is.na(allSRAFinal$strictGeography), !is.na(allSRAFinal$mergedGeography))
# table(!is.na(allSRAFinal$strictRace), !is.na(allSRAFinal$mergedRace))

Now we have two problems - terms in the wrong column, and also double labels. Let’s start with things that had race/geography info in the relevant columns before the recode, but weren’t parsed correctly, ie we decided it was racial when it was coded as ethnicity or viceversa.

# We start with race
allSRAFinal$raceFails <- (!is.na(allSRAFinal$mergedRace) & is.na(allSRAFinal$strictRace)) 
allSRAFinal$geographyFails <- (!is.na(allSRAFinal$mergedGeography) & is.na(allSRAFinal$strictGeography)) 

raceOK <-  allSRAFinal[allSRAFinal$raceFails == FALSE,] #These got classified correctly, we think.
raceOK$rescueGeography <- raceOK$strictGeography
raceFails <- allSRAFinal[allSRAFinal$raceFails == TRUE,] 
raceFails$rescueGeography <- popGeoNew[match(raceFails$mergedRace, popGeoNew$Term),]$IGR.term.strict
allSRATemp <- rbind(raceOK, raceFails) 

# Now we do the same to the mismatches in the geography term:
geographyOK <-  allSRATemp[allSRATemp$geographyFails == FALSE,]
geographyOK$rescueRace <- geographyOK$strictRace
geographyFails <- allSRATemp[allSRATemp$geographyFails == TRUE,] 
geographyFails$rescueRace <- popRaceNew[match(geographyFails$mergedGeography, popRaceNew$Term),]$IGR.term.strict
allSRATemp <- rbind(geographyOK, geographyFails)

# allSRATemp %>% count(strictRace, rescueRace)
# allSRATemp %>% count(strictGeography, rescueGeography)

allSRATemp$strictRace <- coalesce(allSRATemp$strictRace, allSRATemp$rescueRace) 

# table(allSRATemp$finalRace)
# table(allSRATemp$strictRace) 
# 
# table(allSRATemp$finalGeography)
# table(allSRATemp$strictGeography) 

allSRAFinal <- allSRATemp
(rm(list=c("allSRATemp", "raceOK", "raceFails", "geographyOK", "geographyFails")))
## NULL
#And now we deal with things with both kinds of info:
table(!is.na(allSRAFinal$strictGeography), !is.na(allSRAFinal$strictRace)) 
##        
##         FALSE  TRUE
##   FALSE   871 13979
##   TRUE   8591   587
jointTerms <- as.data.frame(table(allSRAFinal$strictRace, allSRAFinal$strictGeography)) %>% .[.$Freq > 0,]
jointTerms
##                         Var1      Var2 Freq
## 3  Black or African American  Americas    3
## 7                      White  Americas   10
## 9                      Asian      Asia   39
## 16                     Asian East Asia  246
## 24 Black or African American    Europe   10
## 28                     White    Europe  279
mismatches <- allSRAFinal[!is.na(allSRAFinal$strictGeography) & !is.na(allSRAFinal$strictRace),] 
table(mismatches$SRA.Study) # Quite a lot of studies... where are they from?
## 
## SRP068551 SRP103099 SRP107326 SRP122876 SRP150833 SRP159625 SRP179998 SRP241159 SRP292867 SRP301528 SRP362734 
##        18        11       208        16        24        10        13       164        24        60        39
mismatches %>% count(finalCountry, SRA.Study, strictGeography, strictRace)
##    finalCountry SRA.Study strictGeography                strictRace   n
## 1        Brazil SRP179998        Americas Black or African American   3
## 2        Brazil SRP179998        Americas                     White  10
## 3         China SRP068551       East Asia                     Asian  18
## 4         China SRP362734            Asia                     Asian  39
## 5       Denmark SRP122876          Europe                     White  16
## 6         Korea SRP292867       East Asia                     Asian  20
## 7         Korea SRP292867          Europe                     White   4
## 8        Russia SRP150833          Europe                     White  24
## 9         Spain SRP301528          Europe                     White  60
## 10       Taiwan SRP107326       East Asia                     Asian 208
## 11          USA SRP103099          Europe                     White  11
## 12          USA SRP159625          Europe Black or African American  10
## 13          USA SRP241159          Europe                     White 164
# In this case we should be more nuanced, rather than simply shoving things one way or another on the basis of depositing country (although I am tempted):
# by(mismatches, mismatches$SRA.Study, function(x) head(x, n=10)) 

Some familiar names in here that I’ve had to decide on before, and new entries:

  • SRP068551 <- reported Ethnicity (Han Chinese) and Race (Asian), set to geography: East Asia only
  • SRP068551 <- reported Ethnicity and Race both as “White or Caucasian”, because they hate me. USA, so we’re going with Race: White (new)
  • SRP107326 <- reported Ethnicity (Taiwanese) and Race (Mongoloid), set to geography: East Asia only
  • SRP122876 <- reported race (Caucasian) and Population (Danish), set to geography: Europe only
  • SRP150833 <- reported ethnicity (Caucasoid) and Race (White), Russian study, set to geography: Europe (new)
  • SRP159625 <- reported ethnicity (English) and Race (Black), but reported race more often than ethnicity. Study was carried out at NIH in Bethesda, set to race: Black or African American only. No idea where ‘English’ came from.
  • SRP179998 <- reported race (various) and ethnicity (Brazilian), Brazil recognises the concept of races legally but they are not aligned with USA ones, set to geography: Americas only
  • SRP241159 <- reported race and ethnicity both as “Caucasian”. USA Study, so Race: White (new)
  • SRP292867 <- reported ethnicity (Korean/German) and race (Asian/White), set to geography only
  • SRP301528 <- reported Ethnicity (Caucasian) and race (White), Spanish study. Set to geography: Europe (new)
  • SRP362734 <- reported ethnicity (Asian) and race (Asian), data generated in China, set to geography: Asia
allSRAFinal[allSRAFinal$SRA.Study %in% c("SRP068551", "SRP107326", "SRP122876", "SRP150833", "SRP292867", "SRP179998", "SRP362734", "SRP301528"),]$strictRace <- NA
allSRAFinal[allSRAFinal$SRA.Study %in% c("SRP068551", "SRP159625", "SRP241159"),]$strictGeography <- NA

Option 3: Intensive curation:

Not repeating this, since it’s what we did to begin with. Captured in the columns finalRace and finalGeography

Now we compare all three approaches.

  1. How are samples parsed across the three levels of description?
  2. How are terms parsed across depositing country?
  3. How many studies end up split between race and geography?

We begin with the easy one:

1. How are samples parsed across the three levels of description?

Reminder that final is the most artisanal, strict is a bit more conservative and strictest does not move terms across columns even when these are patently in the wrong place.

# Let's clean up the terms again
allSRAFinal <- allSRAFinal %>% 
    mutate(strictRace = gsub("or ", "or\n", strictRace)) %>%
    mutate(strictestRace = gsub("or ", "or\n", strictestRace)) %>%
    mutate(strictRace = gsub("and ", "and\n", strictRace)) %>%
    mutate(strictestRace = gsub("and ", "and\n", strictestRace)) %>%
    mutate(strictGeography = gsub("and ", "and\n", strictGeography)) %>%
    mutate(strictestGeography = gsub("and ", "and\n", strictestGeography)) %>%
    as.data.frame()

# And let's remove the intermediate columns because they're annoying:
allSRAFinal <- allSRAFinal[,c(1:41,44:47)]

# Some big changes in White/European and Black/Subsaharan Africa if we are very strict with terms...
allSRAFinal %>% count(finalGeography, strictGeography, strictestGeography)
##                    finalGeography                strictGeography             strictestGeography     n
## 1                        Americas                       Americas                       Americas   164
## 2                            Asia                           Asia                           Asia   304
## 3                       East Asia                      East Asia                      East Asia  1103
## 4                       East Asia                           <NA>                      East Asia    18
## 5                       East Asia                           <NA>                           <NA>   212
## 6                          Europe                         Europe                         Europe  2952
## 7                          Europe                         Europe                           <NA>   304
## 8                          Europe                           <NA>                           <NA>    25
## 9                        Multiple                       Multiple                       Multiple   206
## 10 North Africa and\nWestern Asia North Africa and\nWestern Asia North Africa and\nWestern Asia    18
## 11 North Africa and\nWestern Asia                     South Asia                     South Asia     8
## 12 North Africa and\nWestern Asia                           <NA>                           <NA>     1
## 13                          Other                          Other                          Other     1
## 14                     South Asia                     South Asia                     South Asia   710
## 15                     South Asia                           <NA>                           <NA>    63
## 16                 Southeast Asia                 Southeast Asia                 Southeast Asia    49
## 17                 Southeast Asia                           <NA>                           <NA>    34
## 18              Subsaharan Africa              Subsaharan Africa              Subsaharan Africa   514
## 19              Subsaharan Africa              Subsaharan Africa                           <NA>     9
## 20              Subsaharan Africa                           <NA>                           <NA>     1
## 21                           <NA>                         Europe                         Europe  2495
## 22                           <NA>                         Europe                           <NA>    11
## 23                           <NA>                          Other                          Other    61
## 24                           <NA>              Subsaharan Africa              Subsaharan Africa    77
## 25                           <NA>                           <NA>                       Americas     2
## 26                           <NA>                           <NA>                           Asia     6
## 27                           <NA>                           <NA>                         Europe    22
## 28                           <NA>                           <NA>                       Multiple     1
## 29                           <NA>                           <NA>                        Oceania    11
## 30                           <NA>                           <NA>              Subsaharan Africa  1103
## 31                           <NA>                           <NA>                           <NA> 13543
allSRAFinal %>% count(finalRace, strictRace, strictestRace)
##                                     finalRace                                  strictRace                               strictestRace     n
## 1          American Indian or\nAlaskan Native          American Indian and\nAlaska Native          American Indian and\nAlaska Native    65
## 2          American Indian or\nAlaskan Native          American Indian and\nAlaska Native                                        <NA>     2
## 3          American Indian or\nAlaskan Native                                        <NA>                                        <NA>    16
## 4                                       Asian                                       Asian                                       Asian   658
## 5                  Black or\nAfrican American                  Black or\nAfrican American                  Black or\nAfrican American  1813
## 6                  Black or\nAfrican American                  Black or\nAfrican American                                        <NA>  1103
## 7                  Black or\nAfrican American                                        <NA>                                        <NA>    77
## 8                                    Multiple                                    Multiple                                    Multiple   237
## 9                                    Multiple                                    Multiple                                        <NA>     7
## 10 Native Hawaiian or\nother Pacific Islander Native Hawaiian and\nother Pacific Islander Native Hawaiian and\nother Pacific Islander     8
## 11 Native Hawaiian or\nother Pacific Islander Native Hawaiian and\nother Pacific Islander                                        <NA>    11
## 12                                      Other                                       Other                                       Other   217
## 13                                      Other                                        <NA>                                        <NA>    61
## 14                                      White                                       White                                       White 10021
## 15                                      White                                       White                                        <NA>    22
## 16                                      White                                        <NA>                                        <NA>  2495
## 17                                       <NA>                                        <NA>                                       Asian   309
## 18                                       <NA>                                        <NA>                  Black or\nAfrican American    10
## 19                                       <NA>                                        <NA>                                       White   330
## 20                                       <NA>                                        <NA>                                        <NA>  6566
# Repeat the same plots as above:
sampleGeography <- allSRAFinal %>% count(SRA.Study, finalGeography, strictGeography, strictestGeography) 
meltSampleGeography <- melt(sampleGeography)
## Using SRA.Study, finalGeography, strictGeography, strictestGeography as id variables
# First, by the original classification: finalGeography
meltSampleGeography %>% drop_na(finalGeography) %>% 
ggplot(., aes(x = finalGeography, y=value, fill=finalGeography)) +
  geom_bar(stat="identity", alpha=0.6) +
  theme_bw() +
  ggtitle("finalGeography across all studies") +
  # xlab("System") +
  ylab("Count") +
  # scale_y_continuous(trans='log10') +
  scale_fill_viridis(discrete=T, na.value="grey50", option="plasma") +
  theme(axis.text.x = element_text(angle = 45, hjust=1)) +
  theme(legend.position = "none")

meltSampleGeography %>% drop_na(strictGeography) %>% 
ggplot(., aes(x = strictGeography, y=value, fill=strictGeography)) +
  geom_bar(stat="identity", alpha=0.6) +
  theme_bw() +
  ggtitle("strictGeography across all studies") +
  # xlab("System") +
  ylab("Count") +
  # scale_y_continuous(trans='log10') +
  scale_fill_viridis(discrete=T, na.value="grey50", option="plasma") +
  theme(axis.text.x = element_text(angle = 45, hjust=1)) +
  theme(legend.position = "none")

meltSampleGeography %>% drop_na(strictestGeography) %>% 
ggplot(., aes(x = strictestGeography, y=value, fill=strictestGeography)) +
  geom_bar(stat="identity", alpha=0.6) +
  theme_bw() +
  ggtitle("strictestGeography across all studies") +
  # xlab("System") +
  ylab("Count") +
  # scale_y_continuous(trans='log10') +
  scale_fill_viridis(discrete=T, na.value="grey50", option="plasma") +
  theme(axis.text.x = element_text(angle = 45, hjust=1)) +
  theme(legend.position = "none")

# And now the same for race:
sampleRace <- allSRAFinal %>% count(SRA.Study, finalRace, strictRace, strictestRace) 
meltSampleRace <- melt(sampleRace)
## Using SRA.Study, finalRace, strictRace, strictestRace as id variables
# First, by the original classification: finalRace
meltSampleRace %>% drop_na(finalRace) %>% 
ggplot(., aes(x = finalRace, y=value, fill=finalRace)) +
  geom_bar(stat="identity", alpha=0.6) +
  theme_bw() +
  ggtitle("finalRace across all studies") +
  # xlab("System") +
  ylab("Count") +
  # scale_y_continuous(trans='log10') +
  scale_fill_viridis(discrete=T, na.value="grey50") +
  theme(axis.text.x = element_text(angle = 45, hjust=1)) +
  theme(legend.position = "none")

meltSampleRace %>% drop_na(strictRace) %>% 
ggplot(., aes(x = strictRace, y=value, fill=strictRace)) +
  geom_bar(stat="identity", alpha=0.6) +
  theme_bw() +
  ggtitle("strictRace across all studies") +
  # xlab("System") +
  ylab("Count") +
  # scale_y_continuous(trans='log10') +
  scale_fill_viridis(discrete=T, na.value="grey50") +
  theme(axis.text.x = element_text(angle = 45, hjust=1)) +
  theme(legend.position = "none")

meltSampleRace %>% drop_na(strictestRace) %>% 
ggplot(., aes(x = strictestRace, y=value, fill=strictestRace)) +
  geom_bar(stat="identity", alpha=0.6) +
  theme_bw() +
  ggtitle("strictestRace across all studies") +
  # xlab("System") +
  ylab("Count") +
  # scale_y_continuous(trans='log10') +
  scale_fill_viridis(discrete=T, na.value="grey50") +
  theme(axis.text.x = element_text(angle = 45, hjust=1)) +
  theme(legend.position = "none")

2: How does term assignment look by country across methods??

Geography:

allSRAFinal %>% group_by(finalCountry, finalGeography) %>% summarise(n = sum(!is.na(finalGeography))) %>% drop_na(finalGeography) %>%
  ggplot(aes(x = finalCountry, y = n, fill=finalGeography)) +
    geom_bar(position = position_dodge2(width = 0.9, preserve = "single"), stat="identity", alpha=0.8) +
    theme_bw() +
    ggtitle("finalGeography across all studies") +
    xlab("SRA Depositor location") +
    ylab("Samples (log10)") +
    scale_y_continuous(trans='log10') +
    scale_fill_viridis(discrete=T, na.value="grey50", option="plasma") +
    theme(axis.text.x = element_text(angle = 45, hjust=1)) +
    theme(legend.position = "bottom")
## `summarise()` has grouped output by 'finalCountry'. You can override using the `.groups` argument.

allSRAFinal %>% group_by(finalCountry, strictGeography) %>% summarise(n = sum(!is.na(strictGeography))) %>% drop_na(strictGeography) %>%
  ggplot(aes(x = finalCountry, y = n, fill=strictGeography)) +
    geom_bar(position = position_dodge2(width = 0.9, preserve = "single"), stat="identity", alpha=0.8) +
    theme_bw() +
    ggtitle("strictGeography across all studies") +
    xlab("SRA Depositor location") +
    ylab("Samples (log10)") +
    scale_y_continuous(trans='log10') +
    scale_fill_viridis(discrete=T, na.value="grey50", option="plasma") +
    theme(axis.text.x = element_text(angle = 45, hjust=1)) +
    theme(legend.position = "bottom")
## `summarise()` has grouped output by 'finalCountry'. You can override using the `.groups` argument.

allSRAFinal %>% group_by(finalCountry, strictestGeography) %>% summarise(n = sum(!is.na(strictestGeography))) %>% drop_na(strictestGeography) %>%
  ggplot(aes(x = finalCountry, y = n, fill=strictestGeography)) +
    geom_bar(position = position_dodge2(width = 0.9, preserve = "single"), stat="identity", alpha=0.8) +
    theme_bw() +
    ggtitle("strictestGeography across all studies") +
    xlab("SRA Depositor location") +
    ylab("Samples (log10)") +
    scale_y_continuous(trans='log10') +
    scale_fill_viridis(discrete=T, na.value="grey50", option="plasma") +
    theme(axis.text.x = element_text(angle = 45, hjust=1)) +
    theme(legend.position = "bottom")
## `summarise()` has grouped output by 'finalCountry'. You can override using the `.groups` argument.

Race:

allSRAFinal %>% group_by(finalCountry, finalRace) %>% summarise(n = sum(!is.na(finalRace))) %>% drop_na(finalRace) %>%
  ggplot(aes(x = finalCountry, y = n, fill=finalRace)) +
    geom_bar(position = position_dodge2(width = 0.9, preserve = "single"), stat="identity", alpha=0.8) +
    theme_bw() +
    ggtitle("finalRace across all studies") +
    xlab("SRA Depositor location") +
    ylab("Samples (log10)") +
    scale_y_continuous(trans='log10') +
    scale_fill_viridis(discrete=T, na.value="grey50") +
    theme(axis.text.x = element_text(angle = 45, hjust=1)) +
    theme(legend.position = "bottom")
## `summarise()` has grouped output by 'finalCountry'. You can override using the `.groups` argument.

allSRAFinal %>% group_by(finalCountry, strictRace) %>% summarise(n = sum(!is.na(strictRace))) %>% drop_na(strictRace) %>%
  ggplot(aes(x = finalCountry, y = n, fill=strictRace)) +
    geom_bar(position = position_dodge2(width = 0.9, preserve = "single"), stat="identity", alpha=0.8) +
    theme_bw() +
    ggtitle("strictRace across all studies") +
    xlab("SRA Depositor location") +
    ylab("Samples (log10)") +
    scale_y_continuous(trans='log10') +
    scale_fill_viridis(discrete=T, na.value="grey50") +
    theme(axis.text.x = element_text(angle = 45, hjust=1)) +
    theme(legend.position = "bottom")
## `summarise()` has grouped output by 'finalCountry'. You can override using the `.groups` argument.

allSRAFinal %>% group_by(finalCountry, strictestRace) %>% summarise(n = sum(!is.na(strictestRace))) %>% drop_na(strictestRace) %>%
  ggplot(aes(x = finalCountry, y = n, fill=strictestRace)) +
    geom_bar(position = position_dodge2(width = 0.9, preserve = "single"), stat="identity", alpha=0.8) +
    theme_bw() +
    ggtitle("strictestRace across all studies") +
    xlab("SRA Depositor location") +
    ylab("Samples (log10)") +
    scale_y_continuous(trans='log10') +
    scale_fill_viridis(discrete=T, na.value="grey50") +
    theme(axis.text.x = element_text(angle = 45, hjust=1)) +
    theme(legend.position = "bottom")
## `summarise()` has grouped output by 'finalCountry'. You can override using the `.groups` argument.

Now we flip the axes and the groupings:

allSRAFinal %>% group_by(finalCountry, finalGeography) %>% summarise(n = sum(!is.na(finalGeography))) %>% drop_na(finalGeography) %>%
  ggplot(aes(fill = finalCountry, y = n, x=finalGeography)) +
    geom_bar(position = position_dodge2(width = 0.9, preserve = "single"), stat="identity", alpha=0.8) +
    theme_bw() +
    ggtitle("finalGeography across all studies") +
    xlab("Population descriptor") +
    ylab("Samples (log10)") +
    scale_y_continuous(trans='log10') +
    # scale_fill_viridis(discrete=T, na.value="grey50", option="plasma") +
    theme(axis.text.x = element_text(angle = 45, hjust=1)) +
    theme(legend.position = "bottom")
## `summarise()` has grouped output by 'finalCountry'. You can override using the `.groups` argument.

allSRAFinal %>% group_by(finalCountry, strictGeography) %>% summarise(n = sum(!is.na(strictGeography))) %>% drop_na(strictGeography) %>%
  ggplot(aes(fill = finalCountry, y = n, x=strictGeography)) +
    geom_bar(position = position_dodge2(width = 0.9, preserve = "single"), stat="identity", alpha=0.8) +
    theme_bw() +
    ggtitle("strictGeography across all studies") +
    xlab("Population descriptor") +
    ylab("Samples (log10)") +
    scale_y_continuous(trans='log10') +
    # scale_fill_viridis(discrete=T, na.value="grey50", option="plasma") +
    theme(axis.text.x = element_text(angle = 45, hjust=1)) +
    theme(legend.position = "bottom")
## `summarise()` has grouped output by 'finalCountry'. You can override using the `.groups` argument.

allSRAFinal %>% group_by(finalCountry, strictestGeography) %>% summarise(n = sum(!is.na(strictestGeography))) %>% drop_na(strictestGeography) %>%
  ggplot(aes(fill = finalCountry, y = n, x=strictestGeography)) +
    geom_bar(position = position_dodge2(width = 0.9, preserve = "single"), stat="identity", alpha=0.8) +
    theme_bw() +
    ggtitle("strictestGeography across all studies") +
    xlab("Population descriptor") +
    ylab("Samples (log10)") +
    scale_y_continuous(trans='log10') +
    # scale_fill_viridis(discrete=T, na.value="grey50", option="plasma") +
    theme(axis.text.x = element_text(angle = 45, hjust=1)) +
    theme(legend.position = "bottom")
## `summarise()` has grouped output by 'finalCountry'. You can override using the `.groups` argument.

Race:

allSRAFinal %>% group_by(finalCountry, finalRace) %>% summarise(n = sum(!is.na(finalRace))) %>% drop_na(finalRace) %>%
  ggplot(aes(fill = finalCountry, y = n, x=finalRace)) +
    geom_bar(position = position_dodge2(width = 0.9, preserve = "single"), stat="identity", alpha=0.8) +
    theme_bw() +
    ggtitle("finalRace across all studies") +
    xlab("Population descriptor") +
    ylab("Samples (log10)") +
    scale_y_continuous(trans='log10') +
    # scale_fill_viridis(discrete=T, na.value="grey50") +
    theme(axis.text.x = element_text(angle = 45, hjust=1)) +
    theme(legend.position = "bottom")
## `summarise()` has grouped output by 'finalCountry'. You can override using the `.groups` argument.

allSRAFinal %>% group_by(finalCountry, strictRace) %>% summarise(n = sum(!is.na(strictRace))) %>% drop_na(strictRace) %>%
  ggplot(aes(fill = finalCountry, y = n, x=strictRace)) +
    geom_bar(position = position_dodge2(width = 0.9, preserve = "single"), stat="identity", alpha=0.8) +
    theme_bw() +
    ggtitle("strictRace across all studies") +
    xlab("Population descriptor") +
    ylab("Samples (log10)") +
    scale_y_continuous(trans='log10') +
    # scale_fill_viridis(discrete=T, na.value="grey50") +
    theme(axis.text.x = element_text(angle = 45, hjust=1)) +
    theme(legend.position = "bottom")
## `summarise()` has grouped output by 'finalCountry'. You can override using the `.groups` argument.

allSRAFinal %>% group_by(finalCountry, strictestRace) %>% summarise(n = sum(!is.na(strictestRace))) %>% drop_na(strictestRace) %>%
  ggplot(aes(fill = finalCountry, y = n, x=strictestRace)) +
    geom_bar(position = position_dodge2(width = 0.9, preserve = "single"), stat="identity", alpha=0.8) +
    theme_bw() +
    ggtitle("strictestRace across all studies") +
    xlab("Population descriptor") +
    ylab("Samples (log10)") +
    scale_y_continuous(trans='log10') +
    # scale_fill_viridis(discrete=T, na.value="grey50") +
    theme(axis.text.x = element_text(angle = 45, hjust=1)) +
    theme(legend.position = "bottom")
## `summarise()` has grouped output by 'finalCountry'. You can override using the `.groups` argument.

3: How many studies end up with terms in multiple categories?

allSRAFinal %>% group_by(SRA.Study, finalGeography, finalRace) %>% 
  summarise(n = n()) %>% 
  group_by(SRA.Study) %>% 
  summarise(nGeo = sum(n[!is.na(finalGeography)]), nRace = sum(n[!is.na(finalRace)])) %>% 
  filter(if_all(contains("n"), ~ .x > 0)) %>%
  as.data.frame() %>%
  melt() %>%
  ggplot(., aes(x = SRA.Study, y = value, fill=variable)) +
    geom_bar(stat="identity") +
    theme_bw() +
    ggtitle("Studies with both finalRace and finalGeography assignments") +
    ylab("Count") +
    scale_fill_brewer(palette = "Set1") +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
    theme(legend.position="bottom")
## `summarise()` has grouped output by 'SRA.Study', 'finalGeography'. You can override using the `.groups` argument.
## Using SRA.Study as id variables

allSRAFinal %>% group_by(SRA.Study, strictGeography, strictRace) %>% 
  summarise(n = n()) %>% 
  group_by(SRA.Study) %>% 
  summarise(nGeo = sum(n[!is.na(strictGeography)]), nRace = sum(n[!is.na(strictRace)])) %>% 
  filter(if_all(contains("n"), ~ .x > 0)) %>%
  as.data.frame() %>%
  melt() %>%
  ggplot(., aes(x = SRA.Study, y = value, fill=variable)) +
    geom_bar(stat="identity") +
    theme_bw() +
    ggtitle("Studies with both strictRace and strictGeography assignments") +
    ylab("Count") +
    scale_fill_brewer(palette = "Set1") +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
    theme(legend.position="bottom")
## `summarise()` has grouped output by 'SRA.Study', 'strictGeography'. You can override using the `.groups` argument.
## Using SRA.Study as id variables

# Can't be plotted because there are none. 
try(allSRAFinal %>% group_by(SRA.Study, strictestGeography, strictestRace) %>% 
  summarise(n = n()) %>% 
  group_by(SRA.Study) %>% 
  summarise(nGeo = sum(n[!is.na(strictestGeography)]), nRace = sum(n[!is.na(strictestRace)])) %>% 
  filter(if_all(contains("n"), ~ .x > 0)) %>%
  as.data.frame() %>%
  melt() %>%
  ggplot(., aes(x = SRA.Study, y = value, fill=variable)) +
    geom_bar(stat="identity") +
    theme_bw() +
    ggtitle("Studies with both strictestRace and strictestGeography assignments") +
    ylab("Count") +
    scale_fill_brewer(palette = "Set1") +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
    theme(legend.position="bottom"))
## `summarise()` has grouped output by 'SRA.Study', 'strictestGeography'. You can override using the `.groups` argument.
## Using SRA.Study as id variables
## Error in data.frame(ids, x, data[, x]) : 
##   arguments imply differing number of rows: 0, 1

Stopping here for now - I am feeling like strictest is my favourite, actually. I had previously looked at how to resolve some of the conflicted studies (at the final level, there’s some new ones in the strict level), but hadn’t implemented it…

But now we gotta deal with one final hurdle: hispanic:

Coding hispanics one last time:

# How many studies have hispanic info and race or ethnicity info?

allSRAFinal %>% count(strictestRace, hispanic)
##                                  strictestRace     hispanic    n
## 1           American Indian and\nAlaska Native     hispanic   22
## 2           American Indian and\nAlaska Native non.hispanic    1
## 3           American Indian and\nAlaska Native         <NA>   42
## 4                                        Asian     hispanic   23
## 5                                        Asian non.hispanic  255
## 6                                        Asian         <NA>  689
## 7                   Black or\nAfrican American     hispanic   81
## 8                   Black or\nAfrican American non.hispanic  334
## 9                   Black or\nAfrican American         <NA> 1408
## 10                                    Multiple     hispanic   34
## 11                                    Multiple non.hispanic   44
## 12                                    Multiple         <NA>  159
## 13 Native Hawaiian and\nother Pacific Islander     hispanic    2
## 14 Native Hawaiian and\nother Pacific Islander non.hispanic    4
## 15 Native Hawaiian and\nother Pacific Islander         <NA>    2
## 16                                       Other     hispanic   55
## 17                                       Other non.hispanic  132
## 18                                       Other         <NA>   30
## 19                                       White     hispanic  651
## 20                                       White non.hispanic 3672
## 21                                       White         <NA> 6028
## 22                                        <NA>     hispanic  387
## 23                                        <NA> non.hispanic  166
## 24                                        <NA>         <NA> 9807
allSRAFinal %>% count(strictestGeography, hispanic)
##                strictestGeography     hispanic    n
## 1                        Americas         <NA>  166
## 2                            Asia         <NA>  310
## 3                       East Asia         <NA> 1121
## 4                          Europe non.hispanic   34
## 5                          Europe         <NA> 5435
## 6                        Multiple         <NA>  207
## 7  North Africa and\nWestern Asia         <NA>   18
## 8                         Oceania         <NA>   11
## 9                           Other         <NA>   62
## 10                     South Asia         <NA>  718
## 11                 Southeast Asia         <NA>   49
## 12              Subsaharan Africa         <NA> 1694
## 13                           <NA>     hispanic 1255
## 14                           <NA> non.hispanic 4574
## 15                           <NA>         <NA> 8374
allSRAFinal %>% count(hispanic, strictestRace)
##        hispanic                               strictestRace    n
## 1      hispanic          American Indian and\nAlaska Native   22
## 2      hispanic                                       Asian   23
## 3      hispanic                  Black or\nAfrican American   81
## 4      hispanic                                    Multiple   34
## 5      hispanic Native Hawaiian and\nother Pacific Islander    2
## 6      hispanic                                       Other   55
## 7      hispanic                                       White  651
## 8      hispanic                                        <NA>  387
## 9  non.hispanic          American Indian and\nAlaska Native    1
## 10 non.hispanic                                       Asian  255
## 11 non.hispanic                  Black or\nAfrican American  334
## 12 non.hispanic                                    Multiple   44
## 13 non.hispanic Native Hawaiian and\nother Pacific Islander    4
## 14 non.hispanic                                       Other  132
## 15 non.hispanic                                       White 3672
## 16 non.hispanic                                        <NA>  166
## 17         <NA>          American Indian and\nAlaska Native   42
## 18         <NA>                                       Asian  689
## 19         <NA>                  Black or\nAfrican American 1408
## 20         <NA>                                    Multiple  159
## 21         <NA> Native Hawaiian and\nother Pacific Islander    2
## 22         <NA>                                       Other   30
## 23         <NA>                                       White 6028
## 24         <NA>                                        <NA> 9807
# Not sure if this is due to how we coded it, but that's pleasant. Except, what to do with the people with two labels??

 saveRDS(allSRAFinal, file="20240901_allSRAFinal.rds")